People Counting Using Detection And Tracking
Techniques For Smart Video Surveillance
Ha Thi Oanh
Ha Noi University of Science and Technology
Supervisor
Assoc. Prof. Tran Thi Thanh Hai
In partial fulfillment of the requirements for the degree of
Master of Computer Science
April 20, 2023
Acknowledgements
First of all, I would like to express my gratitude to my primary advisor,
Assoc. Prof. Tran Thi Thanh Hai, who guided me throughout this project.
I would like to thank Assoc. Prof. Le Thi Lan and Assoc. Prof. Vu Hai
for giving me deep insight, valuable recommendations and brilliant idea.
I am grateful for my time spent at MICA International Research Institute,
where I learnt a lot about research and enjoyed a very warm and friendly
working atmosphere. In particular, I wish to extend my special thanks to
Dr. Doan Thi Huong Giang who directly supported me.
The master’s thesis is within the framework of the ministerial-level scientific
research project ”Research and development of an automatic system for
assessing learning activities in class based on image processing technology
and artificial intelligence” code CT2020.02.BKA.02 led by Assoc. Prof. Dr.
Le Thi Lan. Students sincerely thank the topic.
Finally, I wish to show my appreciation to all my friends and family mem-
bers who helped me finalizing the project.
Abstract
Video or image-based people counting in real-time has multiple applica-
tions in intelligent transportation, density estimation or class management,
and so on. Although this problem has been widely studied, it stills face
some main challenges due to crowded scene and occlusion. In a common
approach, this problem is carried out by detecting people using conven-
tional detectors. However, this approach can be failed when people stay in
various postures or are occluded by each other. We notice that even a main
part of human body is occluded, their face and head are still observable.
In addition, a person can not be detected at a frame but may be recovered
at the previous or the next frames.
In this thesis, we attempt to improve the people counting result beyond
these observations. We first deploy two detectors (Yolo and Retina-Face)
for detecting heads and for faces of people in the scene. We then develop
a pairing technique that aligns the face and the head of each person. This
alignment helps to recover the missed detection of head or face thus in-
creases the true positive rate. To overcome the missed detection of both
face and head at a certain frame, we apply a tracking technique (i.e. SORT)
on the combined detection result. Putting all of these techniques in an uni-
fied framework helps to increases the true positive rates from 90.36% to
96.21% on ClassHead Part 2 dataset.
Contents
List of Acronymtypes x
1 Introduction 1
1.1 Introduction to people counting . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scientific and practical significance . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Scientific significance . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Practical significance . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related works 9
2.1 Detection based people counting . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Face detection based people counting . . . . . . . . . . . . . . . 9
2.1.2 Head detection based on people counting . . . . . . . . . . . . . 11
2.1.3 Hybrid detection based on people counting . . . . . . . . . . . . 13
2.2 Density estimation based people counting . . . . . . . . . . . . . . . . . 15
2.3 People tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Overview of object tracking . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Multiple Object Tracking . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Tracking techniques . . . . . . . . . . . . . . . . . . . . . . . . . 18
iii
CONTENTS
2.3.3.1 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3.2 SORT . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3.3 DeepSORT . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Tracking-based people counting . . . . . . . . . . . . . . . . . . 26
2.4 Conclusion of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Proposed method for people counting 30
3.1 The proposed people counting framework . . . . . . . . . . . . . . . . . 30
3.2 Yolo-based head detection . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Yolo revisit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Yolov5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Implementation of Yolov5 for head detection . . . . . . . . . . . 38
3.3 RetinaFace based face detection . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 RetinaFace architecture . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Implementation of RetinaFace for face detection . . . . . . . . . 43
3.4 Combination of head and face detection . . . . . . . . . . . . . . . . . . 44
3.4.1 Linear sum assignment problem . . . . . . . . . . . . . . . . . . 44
3.4.2 Head-face pairing cost . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Person tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Experiments 50
4.1 Dataset and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Our collected dataset: ClassHead . . . . . . . . . . . . . . . . . 50
4.1.1.1 ClassHead Part 1 . . . . . . . . . . . . . . . . . . . . . 53
4.1.1.2 ClassHead Part 2 . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 Hollywood Heads dataset . . . . . . . . . . . . . . . . . . . . . 55
4.1.3 Casablanca dataset . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.4 Wider Face dataset . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . 57
iv
CONTENTS
4.1.5.1 Intersection over Union (IoU) . . . . . . . . . . . . . . 59
4.1.5.2 Precision and Recall . . . . . . . . . . . . . . . . . . . 59
4.1.5.3 F1-score . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.5.4 AP and mAP . . . . . . . . . . . . . . . . . . . . . . 61
4.1.5.5 Mean Absolute Error . . . . . . . . . . . . . . . . . . . 61
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Evaluation on Hollywood dataset . . . . . . . . . . . . . . . . . 62
4.2.2 Evaluation on Casablanca dataset . . . . . . . . . . . . . . . . . 63
4.2.3 Evaluation on Wider Face dataset . . . . . . . . . . . . . . . . . 66
4.2.4 Evaluation on ClassHead Part 2 dataset . . . . . . . . . . . . . 66
5 Conclusions 72
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
References 81
v
List of Figures
1.1 Illustration of the input and output of people counting from an image. 2
1.2 Some challenges in crowd counting [1]. . . . . . . . . . . . . . . . . . . 5
2.1 Framework for a people counting based on face detection and tracking
in a video [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 System framework for depth-assisted face detection and association for
people counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 System framework for a people counting method based on head detection
and tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Network structure of Double Anchor R-CNN . . . . . . . . . . . . . . . 13
2.5 Architecture of JointDet . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Examples of people density estimation . . . . . . . . . . . . . . . . . . 16
2.7 Example of Multiple Object Tracking . . . . . . . . . . . . . . . . . . . 19
2.8 Hungarian Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 The tracking process of the SORT algorithm. . . . . . . . . . . . . . . . 25
2.10 Architecture of the proposed people counting and tracking system . . . 27
2.11 Flow architecture of the proposed smart surveillance system . . . . . . 28
3.1 The proposed framework for people counting by pairing head and face
detection and tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Output of Yolo network[3]. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Yolov5 architecture[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
LIST OF FIGURES
3.5 Path Aggregation Network . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Automatic learning of bound box anchors [4] . . . . . . . . . . . . . . . 38
3.7 Activation functions used in Yolov5. (a) SiLU function. (b) Sigmoid
function [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Example for creating dataset.yaml. . . . . . . . . . . . . . . . . . . . . 40
3.9 An overview of the single-stage dense face localisation approach. Reti-
naFace is designed based on the feature pyramids with independent con-
text modules. Following the context modules, we calculate a multi-task
loss for each anchor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Organize dataset for Yolo training. . . . . . . . . . . . . . . . . . . . . 43
3.11 Example of RetinaFace testing on Wider Face dataset. . . . . . . . . . 44
3.12 Flowchart of combining object detection and tracking to improve the
true positive rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Camera layout in the simulated classroom and an image obtained from
each camera view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Illustration of LabelMe interface and main operations to annotate an
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Illustration of images taken from five camera view in ClassHead Part 1
dataset: (a) View 1 , (b) View 2, (c) View 3, (d) View 4 and (e) View 5. 54
4.4 Some example images of ClassHead Part 2 dataset: view ch03 (a), view
ch04 (b), view ch05 (c), and view ch12 (d) and view ch13 (e). . . . . . 56
4.5 Some example images of Hollywood Heads dataset (first row), Casablanca
dataset (second row), Wider Face dataset (third row), and ClassHead
Part 2 of our dataset (last row). . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Calculating IOU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Precision and Recall metrics . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 MAE measurement results on 2 proposed methods in Hollywood Heads
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
LIST OF FIGURES
4.9 Results of Hollywood Heads dataset. (a) Results of head detection; (b)
Results of face detection; (c) Matching head and face detection using the
Hungarian algorithm. Heads are denoted with green, faces are yellow,
missed ground truths are red, and head-face pairings are cyan. . . . . . 64
4.10 MAE measurement results on 2 proposed methods in Casablanca Heads
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Results of Casablanca dataset. (a) Results of head detection; (b) Re-
sults of face detection; (c) Matching head and face detection using the
Hungarian algorithm. Heads are denoted with green, faces are yellow,
missed ground truths are red, and head-face pairings are cyan. . . . . . 65
4.12 Results of Wider Face dataset. (a) Results of head detection; (b) Re-
sults of face detection; (c) Matching head and face detection using the
Hungarian algorithm. Heads are denoted with green, faces are yellow,
missed ground truths are red, and head-face pairings are cyan. . . . . . 67
4.13 MultiDetect results in ClassHead Part 2. (a) Head detections, (b) Face
detections, (c) MultiDetect . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.14 Head tracking method results in ClassHead Part 2 dataset. (a) Head
detections at frame 1, (b) Head tracking at frame 100. . . . . . . . . . 69
4.15 MultiDetect with Track method results in ClassHead Part 2 dataset. (a)
MultiDetect with Track at frame 1, (b) MultiDetect with Track at frame
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.16 MAE measurement results on 3 proposed methods in ClassHead Part 2
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
viii
List of Tables
4.1 Setup camera parameters for data collection. . . . . . . . . . . . . . . . 51
4.2 ClassHead Part 1 dataset for training and testing Head detector Yolov5 55
4.3 ClassHead Part 2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Results of the proposed method on the Hollywood Heads dataset. . . . 63
4.5 Results of the proposed method on the Casablanca dataset. . . . . . . . 64
4.6 Results of the proposed method on Wider Face dataset. . . . . . . . . . 66
4.7 Results of the method of the head detection method in ClassHead Part 2
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8 Results of the method of the MultiDectect in ClassHead Part 2 dataset 68
4.9 Results of the Head Tracking in ClassHead Part 2 dataset. . . . . . . . 69
4.10 Results of method MultiDetect with Track in ClassHead Part 2 dataset. 70
4.11 Experimental results in the ClassHead Part 2 dataset after using 4 meth-
ods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
List of Acronymtypes
CNN Convolutional Neural Network.
HOG Recurrent Neural Network.
LSTM Long short-term memory.
NN Neural Network.
RPN Region Proposal Network.
YOLO You Only Look Once.
x
Chapter 1
Introduction
Recently, people counting in images or video has become an active research topic due
to its wide range of applications, from public safety to intelligent crowd flow. Manual
counting is impractical since it is a tedious and time-consuming task, particularly
in crowded scenes. This chapter aims to define the problem of people counting, its
challenges, and provide discussions on the drawbacks of existing methods to motivate
our work. We then clarify our objectives and contributions to this field. Finally, we
describe the organization of the thesis.
1.1 Introduction to people counting
People counting in crowds refers to the process of accurately counting the number of
individuals present in a densely populated area or space. This is a challenging task due
to the high density of people, occlusions, overlapping individuals, and the need to track
people as they move through the crowd. People counting has been extensively studied
in recent years, and it has numerous real-life applications, including event management,
public safety, and transportation. For instance, it can be utilized to monitor crowd
density and prevent overcrowding in public spaces, optimize and improve security at
events and transportation hubs, etc.
To capture people in crowds, some sensors such as thermal imaging cameras, RGB
cameras, and lasers may be used. RGB cameras are the most commonly utilized due
1
1.2 Scientific and practical significance
to their low-cost and popularity in almost public spaces. From video/data, computer
vision techniques such as object detection and tracking, optical flow, and background
subtraction can identify and track individuals in the crowd. The problem of people
counting from an image is defined as follows:
Input: An image or a frame from a video sequence.
Output: The number of people and/or their locations in the frame/image.
Figure 1.1 depicts the input and output of a people counting algorithm. The algo-
rithm stores the number of people detected and determines the bounding box of each
individual. Depending on the context, location data may be crucial for further pro-
cessing. However, in highly crowded scenes, obtaining an exact count of people may be
impractical, and an estimation of the number of individuals is sufficient. In the next
chapter, we will review some related works that provide an estimation of the people
count with or without location and bounding box information.
Figure 1.1: Illustration of the input and output of people counting from an image.
1.2 Scientific and practical significance
1.2.1 Scientific significance
People counting in crowds of humans has several scientific implications, including:
Crowd dynamics: People counting in crowds provide important data for studying
crowd dynamics, such as how people move, how they interact with each other, and
2
1.2 Scientific and practical significance
how they respond to changes in the environment. This information can be used to
develop mathematical models of crowd behavior and improve our understanding
of crowd dynamics.
Social behavior: People counting in crowds can also provide insights into social
behavior. It can help researchers understand how people interact with each other
in crowded environments, such as how they form groups, how they communicate,
and how they coordinate their movements.
Computer vision and machine learning: people counting in crowds provides an
important application for developing and evaluating computer vision and machine
learning algorithms. It helps to advance the state-of-the-art in object detection,
tracking, segmentation, and classification, which are essential for people counting
in crowded environments.
Sensor technology: People counting in crowds also drive the development of new
sensor technologies, such as cameras, depth sensors, and thermal sensors, that
are designed to capture data in crowded environments. This helps to advance the
field of sensor technology and improve our ability to capture data in challenging
environments.
Human-computer interaction: People counting in crowds can also provide in-
sights into human-computer interaction, particularly in the context of intelligent
systems. It helps to understand how people interact with technology in crowded
environments and how technology can be designed to support people in these
settings.
1.2.2 Practical significance
The people counting problem enables many practical applications. Following are some
typical applications.
Crowd management: People counting in crowds is essential for managing and
controlling large crowds, particularly during events or in public spaces. It can help
3
1.2 Scientific and practical significance
organizers identify high-density areas and take action to prevent overcrowding,
which is critical for public safety.
Retail and marketing: People counting is an essential tool for retailers to optimize
staffing levels, measure customer traffic, and improve customer service. It helps
retailers identify high-traffic areas and monitor customer behavior, such as the
time spent in specific sections or the frequency of return visits.
Public safety and security: People counting is also an important tool for public
safety and security, helping to monitor crowd density and prevent overcrowding
in public spaces, as well as optimize staffing levels and improve security at events
and transportation hubs. It can also assist in tracking and identifying suspects
in security footage.
Transportation: People counting is useful in transportation systems to measure
the usage of different modes of transportation and optimize public transporta-
tion routes and schedules. It helps to reduce congestion, improve efficiency, and
enhance the overall user experience.
Education: Student counting in classroom environment help is an important and
reliable source of information for improving the quality of education by changing
the content and teaching methods.
1.2.3 Challenges and Motivation
To solve the people counting problem, there exists a number of approaches which
achieved impressive accuracy. However, this problem still faces many challenges as
following:
Occlusion: As crowd density increases, individuals may start to occlude each
other, which poses a challenge for traditional detection algorithms and motivates
the development of density estimation models.
4
1.2 Scientific and practical significance
Figure 1.2: Some challenges in crowd counting [1].
Complex background: In a natural scene, the background may be highly cluttered
and contain objects with similar appearances or colors to the foreground, which
can cause confusion.
Scale variation: One of the primary problems that should be addressed in the
density estimation models is the strong variation in the scales of objects. As
a result, almost existing density estimation models are designed to address the
scale variation problem in the first step.
Camera viewpoint: The issue of rotation variation is drastically increased due to
the camera viewpoints, such as different poses and photographic angles.
Illumination variation: The illumination varies at different times of the day,
usually from dark to light and then back to dark, from dawn to dusk.
Weather changes: The scenes in the wild are usually under various types of
weather conditions, such as clear, clouds, rain, fog, thunder, overcast, and extra
sunny.
Figure 1.2 shows some examples where the people are highly occluded (Fig1.2.a), back-
ground is complex (Fig1.2.b), scale variation (Fig1.2.c), camera view change (Fig1.2.d),
5
1.3 Objectives and Contributions
illumination change (Fig1.2.e) and weather change (Fig1.2.f). These challenges can not
be solved in one model. In this thesis, we attempt to improve the people counting per-
formance by overcoming occlusion and scale changes, although some other challenges
may be implicitly resolved thanks to the studied model itself.
1.3 Objectives and Contributions
1.3.1 Objectives
The main goal of this thesis is to improve the performance of people counting from
images/video to overcome the occlusion issue in a crowded scene. To obtain this goal,
following are the specific objectives:
Conduct a survey of existing techniques for human counting, analyze their main
drawbacks, and then propose a suitable solution.
Study and develop techniques for detecting humans that can be used for people
counting and localization.
Improve the techniques to avoid missed detection in crowded scenes.
1.3.2 Contributions
The work of this thesis is within the context of a project granted by the Ministry of
Education and Training (MOET), with the project code CT2020.02.BKA.02. One of
the tasks of this project is to detect and count the number of students in a classroom,
and then create a density map of the students. This will aid in better management
of the students and improve the quality of teaching and learning. As a result, beside
validating the proposed on benchmark dataset, we also take part in building a new
dataset in classroom and test our method on that dataset. We summarize the main
contribution of our work as follows:
6
1.4 Thesis outline
First, we propose a method that combines the detection results of both face
and head to improve the true positive rate of people counting (namely called:
MultiDetect).
Second, we deploy a tracking technique to handle fast-moving objects that may
cause motion blur effects and missed detection (namely called: MultiDetect with
Track).
Finally, we conduct extensive experiments to validate MultiDetect improvement
on three benchmark datasets (WiderFace, Hollyhead, and Casablanca). We also
build a new dataset in the MOET project, in which I participated in collecting
and annotating the data and we conduct extensive experiments to validate both
improvements MultiDetect and MultiDetect with Track in our dataset.
1.4 Thesis outline
The thesis is structured into 5 chapters:
1. Introduction: This chapter provides the definition of the people counting problem
and introduces its scientific and practical significance. Then, we describe some
of the main challenges that motivated the work of this thesis. Finally, we present
our objectives, contributions, and thesis outline.
2. Background and Related Works: This chapter conducts a survey on the deep
learning-based approach for human detection and tracking for the problem of
people counting and localization. We also describe some methods that roughly
estimate the people density without giving localization information. The analysis
on both approaches and our constraints motivate us to follow the detection and
tracking methods. We then present briefly fundamental knowledge about deep
models for human detection and tracking.
3. Proposed Method: This chapter introduces our proposed framework for peo-
ple detection and localization from videos. We describe each component of the
7
1.4 Thesis outline
framework in detail and how to implement it in practice.
4. Experiments: This chapter presents the datasets, evaluation protocol, technical
setup, results, and discussions related to our experiments. In particular, we
describe the process of collecting and annotating our new dataset in a classroom
environment
5. Conclusion: This chapter summarizes our work, highlights the contributions,
analyzing the limitations, and providing some ideas for future research directions
8
Chapter 2
Related works
This chapter presents some basic knowledge as well as related works regarding the
research topic of this thesis. There are two approaches to the people counting problem
from still images: i) detection-based approach that detects individuals and then can
give a number of people in the scene; ii) density estimation-based approach that just
give a roughly a number of persons without location information. Besides, to improve
the detection result, some works deploy tracking techniques. We present these ap-
proaches in sections 2.1 and 2.2 respectively. We then describe the tracking techniques
in section 2.3. Finally, we conclude the chapter in section 2.4.
2.1 Detection based people counting
People counting can be carried out by detecting faces, heads, or bodies depending on
the context and the scene. The most common technique is detecting the human body,
but in cases where the human body is occluded or in a challenging posture, the head
and face can be alternative solutions.
2.1.1 Face detection based people counting
Face detection-based people counting aims to detect and track faces in images or real-
time video streams, then count the number of detected faces. Face detection algorithms
9
2.1 Detection based people counting
typically rely on deep learning models trained on large datasets to identify and locate
faces in images or videos accurately.
Tsong-Yi Chen et al. presented an automatic people-counting system based on face
recognition in which people passing through a gate or door are counted by placing a
video camera [5]. First, they use the image difference to detect the rough edges of
moving people. Then, color features are utilized to locate people’s faces. Based on
the NCC (Normalized Color Coordinate) color space, the face is initially obtained by
detecting the skin tone area, and then the subject’s facial features are analyzed to
determine if the subject is a real face or not. After face detection, a person will be
tracked by following the recognized face, and this person will be added if that person’s
face touches the count line.
Xi Zhao et al. presented a method of counting people based on face detection,
tracking, and trajectory classification [2]. They first performed face detection and
then face tracking by combining a new scale-invariant Kalman filter with a kernel-
based color histogram tracking algorithm. From each face orbit, the angle of those
neighboring points is extracted. Finally, to distinguish the real face trajectory from
the fake one, the authors used the K-NN classification method based on the Earth
Mover’s distance. The framework for this paper is described in Fig.2.1.
Figure 2.1: Framework for a people counting based on face detection and tracking in
a video [2].
Guangyu Zhao et al. [6] developed a system capable of detecting and counting
10
2.1 Detection based people counting
people using the Kinect camera. The author used depth information for false face
detection, and then a 3D data association is used to link tracks with detection results.
Finally, they counted the people who enter the region of interest using a validated
trajectory, as shown in Fig.2.2.
Figure 2.2: Depth-assisted face detection and association for people counting [6].
Face detection nowadays can achieve very high accuracy. However, this problem
still faces challenges such as variations in lighting conditions, occlusions, and face
orientation. Besides, it requires a face in front of the camera. Without that assumption,
the performance of face detection-based people counting may drastically reduce.
2.1.2 Head detection based on people counting
Head detection can be a more flexible solution to deal with constraints on detecting
only frontal faces. Bin Li and al. [7] proposed a people-counting method based on
head detection and tracking. The purpose of this proposal is to evaluate the number
of people who move under an indoor overhead camera. This framework consists of
four parts: foreground extraction, head detection, head tracking, and crossing-line
judgment. Firstly, the proposed method utilizes a foreground extraction method to
obtain foreground regions of moving people, and some morphological operations are
employed to optimize the foreground regions. After that, it exploits an LBP (local
binary pattern) feature-based Adaboost classifier for head detection in the optimized
foreground regions. Once head is detected, it is tracked by a local head tracking method
based on the Mean Shift algorithm. Finally, based on head tracking, the method uses
crossing-line judgment to determine whether the candidate head object will be counted
11
2.1 Detection based people counting
or not, as Fig.2.3.
Figure 2.3: System framework for a people counting method based on head detection
and tracking [7].
In [8], the authors proposed a deep model-based method that works as a head
detector that take scale variations into account. People counting in outdoor venues
face many challenges, such as severe occlusions, few pixels per head, and significant
variations in a person’s head size due to wide sports areas. This method is based on
the notion that the head is the most visible part of sports venues where large numbers
of people are gathered. They generate scale-aware head proposals based on a scale
map to cope with the problem of different scales. Scale-aware proposals are then
fed to the Convolutional Neural Network (CNN), which provides a response matrix
containing the presence probabilities of people observed across scene scales. Finally,
they use non-maximal suppression to get accurate head positions. For the performance
evaluation, they carry out extensive experiments on two standard datasets: S-HOCK
12
2.1 Detection based people counting
and UCF-HDDC.
2.1.3 Hybrid detection based on people counting
In various scenarios, a single head detector or face detector may not provide accurate
results. Therefore, some researchers proposed hybrid detection that combines detection
results from different human parts (body, head, face).
Hybrid detection-based people counting combines human body parts to improve
the efficiency of counting people in a crowd. Double Anchor R-CNN network as Fig.
2.4, proposed by Kevin Zhang [9] recently combines the head and body of a person.
This network consists of 4 stages as follows:
1. A dual-anchor zone recommendation network to generate head and body sugges-
tions in pairs.
2. A cross-recommendation module to generate high-quality training samples for
the R-CNN part.
3. A module to efficiently combine head and body features.
4. A generic NMS (Non-Maximum Suppression) algorithm for post-processing.
Figure 2.4: Architecture of Double Anchor R-CNN [9].
Another approach also combines head and body using JointDet architecture [10].
JoinDet network consists of four major components, as shown in Fig. 2.5: The RPN
13
2.1 Detection based people counting
network, the Head R-CNN, the Body R-CNN, and the RDM. The head-to-body ratio
is then calculated to get whole-body recommendations. The head and body proposals
are then submitted to two parallel R-CNN branches to obtain interim results. These
temporary results are further processed to get the final results, as follows:
Matching head - body using the proposed strategy to output the matched body-
head pairs as pair 1 to pair N;
Extracting corresponding features of each pair for RDM to discriminate their
relation (i.e., whether they belong to the same person);
According to the learned relationship to reduce head false positives and recall
suppressed human detections.
To evaluate the effectiveness of the proposed method, they conducted extensive ex-
periments on the CrowdHuman, CityPersons, and Caltech-USA datasets. The results
show that their method has the best results compared to the previous methods.
Figure 2.5: Architecture of JointDet [10].
Through a survey on the methods of combining human body parts, we discovered
that the accuracy of the hybrid model improved significantly. These findings have
not only provided us with new ideas but also motivated us to develop more robust
people-counting algorithms for crowded scenes.
14
2.2 Density estimation based people counting
2.2 Density estimation based people counting
Video surveillance systems are commonly deployed in very crowd surveillance that is
impractical to detect each individual. As a consequence, a density estimation is an
approach to approximately count the number of people.
The authors in [11] conducted an estimate of people density in a crowded environ-
ment. In this paper, the authors proposed a two fold method. First, they propose a
density estimate for the crowd size. Second, do a count of the people in the crowd.
As the density of the crowd increases, the congestion in the crowd also increases. To
get around this, they can use an improved adaptive K - GMM background subtrac-
tion method to extract the foreground accurately in real-time applications to avoid the
estimation problem. By applying a boundary detection algorithm, they were able to
estimate the size of the crowd. The number of people in a crowd was counted using the
”canny edge detector” algorithm, the ”connected component labeling” method, and
the ”centered bounding box” method. This article proposes a real-time video surveil-
lance system. The above-proposed works are compared with different datasets like
IBM, KTH, CAVIAR, PETS2009, and CROWD. It can be used for both testing and
training phases.
The authors in [12] proposed a supervised learning framework for visual object
counting tasks, such as estimating the number of people in a surveillance video frame.
Their goal is to accurately estimate the number of people. However, they omitted
the difficult task of detecting and locating individual objects. Instead, they proposed
to estimate an integral image density over any image region. Learning how to infer
such a density can be formulated as minimizing a normalized quadratic cost function.
So, they introduced a new loss function, well suited for such learning and efficiently
computable through a subarray maximal algorithm. The problem can then be thought
of as a convex quadratic program, that is solvable with cut-plane optimization. The
proposed framework was flexible, as it can accept any domain-specific visual feature.
Once trained, their system provides the number of objects and requires only a very
short amount of time for the feature extraction step. Therefore, this model becomes
15
2.3 People tracking
a good candidate for applications involving real-time processing or processing huge
amounts of image data. Fig.2.6 illustrates an example of people density estimation.
Figure 2.6: Examples of people density estimation. Counting people in a surveillance
video frame. Close-ups are shown alongside the images. The bottom close-ups show
examples of the dotted annotations (crosses). This framework learns to estimate the
number of objects in the previously unseen images based on a set of training images of
the same kind augmented with dotted annotations. [12].
2.3 People tracking
Tracking is an efficient technique to improve the true positive rate when an object is
missed detected at a given frame. In this section, we briefly present the overview of the
object tracking problem, then two typical tracking techniques (SORT and DeepSORT)
that are widely deployed in literature. We finally analyze some works using tracking
for people counting problems.
2.3.1 Overview of object tracking
Object tracking is a technique used to assign a unique ID to each object as it moves
temporally. The process starts when the object appears and ends when the object
leaves the scene for a certain time. The goal of object tracking is to accurately identify
16
2.3 People tracking
objects of interest, estimate their trajectories in the video, and track them as they
move. The object tracking problem involves:
Object detection: The first step in real-time object tracking is to detect the
object of interest in each frame of the video or image stream. There are various
object detection techniques available, including feature-based methods such as
scale-invariant feature transform (SIFT), speeded-up robust features (SURF),
and histograms of oriented gradients (HOG), as well as deep learning-based object
detection algorithms such as YOLO (You Only Look Once), SSD (Single Shot
Detector), and Faster R-CNN (Region-based Convolutional Neural Network).
Object tracking: Once the object is detected, the next step is to track it over
time. Object tracking can be achieved using various techniques, including optical
flow, mean-shift, particle filters, and Kalman filters. These techniques estimate
the object’s motion and predict its location in subsequent frames.
Data association: In scenarios where there are multiple objects in the video
or image stream, it is essential to associate each object’s location with its corre-
sponding identity. Data association techniques, such as the Hungarian algorithm,
are used to match the detected objects with their previous locations to maintain
their identities over time.
Object re-detection: In some scenarios, the object of interest may disappear from
the video or image stream for a short duration. Object re-detection techniques,
such as template matching or appearance modeling, can be used to re-detect the
object when it reappears.
2.3.2 Multiple Object Tracking
Simple object tracking assumes there is only one object in the scene. Tracking becomes
harder when there are many objects. The multiple object tracking method aims to track
all objects appearing in the frame by detecting and assigning identifiers to each object,
17
2.3 People tracking
as shown in Fig. 2.7. In addition, the IDs assigned to an object need to be consistent
across each frame. Multiple object tracking requires handling:
Accurate object detection: This is a critical task, especially for detection-based
tracking, to ensure the presence of all objects in the scene.
Occluded objects: Objects are partially or completely obscured. When an ID is
assigned to an object, the ID should be consistent throughout the video. However,
when an object is obscured, relying solely on object detection is not enough to
solve this problem.
Object absence: An object may go out of the frame and then reappears. Similar
to the previous issue, this is about ID switches. It is necessary to solve the
problem of object re-identification, including obscuring or disappearing, to reduce
the number of ID switches to the lowest possible level.
Objects trajectories overlapping: Objects with overlapping trajectories can also
lead to the wrong assignment of IDs to objects, which is also a problem to deal
with when working with multiple object tracking.
Figure 2.7 illustrates an example of multiple object tracking. In the first row, people
are firstly detected and bounded by yellow boxes. The second row presents tracked
people overtime. Each person is identified by a color. The last row shows the case one
person is detected in the first frame (red bounding box), but missed in the next frames
due to occlusion, it is still kept by tracking technique.
2.3.3 Tracking techniques
In the literature, there is a number of tracking techniques proposed for different tasks
such as human tracking, robot tracking, and so on. In this section, we review three
conventional techniques: Kalman filter, SORT, and DeepSORT, which are improved
versions by temps.
18
2.3 People tracking
Figure 2.7: Multiple Object Tracking. (a) shows all the detection boxes with their
scores. (b) shows the tracklets obtained by previous methods which associates detection
boxes whose scores are higher than a threshold, i.e. 0.5. The same box color represents
the same identity. (c) shows the tracklets obtained by the proposed method in the
paper. The dashed boxes represent the predicted box of the previous tracklets using
Kalman Filter. The two low score detection boxes are correctly matched to the previous
tracklets based on the large IoU [13].
2.3.3.1 Kalman filter
The Kalman Filter [14] was proposed by R. E. Kalman in 1960. The Kalman filter pre-
dicts the state of an object using previous information. The Kalman filter equations are
categorized into two groups: prediction (updated over time) and correction (updated
by measure). A metric update is used to provide a feedback value that, combined with
the prior state estimate, gives a posterior state estimate.
In order to use the Kalman filter to estimate the internal state of a process given
only a sequence of noisy observations, one must model the process in accordance with
the following framework. This means specifying the matrices, for each time-step k,
19
2.3 People tracking
following:
F
k
: the state-transition model;
H
k
: the observation model;
Q
k
: the covariance of the process noise;
R
k
: the covariance of the observation noise;
and sometimes B
k
: the control-input model as described below; if B
k
is included,
then there is also
u
k
: the control vector, representing the controlling input into control-input model
The Kalman filter model assumes the true state at time k is evolved from the state
at (k 1) according to Eq.2.1:
x
k
= F
k
x
k1
+ B
k
u
k
+ w
k
(2.1)
where:
F
k
is the state transition model which is applied to the previous state x
k1
B
k
is the control-input model which is applied to the control vector u
k
;
w
k
is the process noise, which is assumed to be drawn from a zero mean multi-
variate normal distribution.
At time k an observation (or measurement) z
k
of the true state x
k
is made according
to Eq.3.15:
z
k
= H
k
x
k
+ v
k
(2.2)
where:
H
k
is the observation model, which maps the true state space into the observed
space
20
2.3 People tracking
v
k
is the observation noise, which is assumed to be zero mean Gaussian white
noise with covariance R
k
.
The next processing steps of Kalman Filter can be divided into two main parts
(probability-based approach):
Step 1: Predict
Predicted (a priori) state estimate:
ˆ
x
k|k1
= F
k
x
k1|k1
+ B
k
u
k
(2.3)
Predicted (a priori) estimate covariance:
ˆ
P
k|k1
= F
k
P
k1|k1
F
k
+ Q
k
(2.4)
Step 2: Update
Measurement pre-fit residual:
˜
y
k
= z
k
H
k
ˆ
x
k|k1
(2.5)
Innovation covariance:
S
k
= H
k
ˆ
P
k|k1
H
k
+ R
k
(2.6)
Optimal Kalman gain:
K
k
=
ˆ
P
k|k1
H
k
S
1
k
(2.7)
Updated (a priori) state estimate:
c
k|k
=
ˆ
c
k|k1
+ K
k
˜
y
k
(2.8)
Updated (a priori) estimate covariance:
P
k|k
= (I K
k
H
k
)
ˆ
P
k|k1
(2.9)
21
2.3 People tracking
Measurement post-fit residual:
˜
y
k|k
= z
k
H
k
x
k|k
(2.10)
2.3.3.2 SORT
SORT [15] is an acronym for Simple Online Real-time Object Tracking, an algorithm
belonging to tracking-by-detection (or detection-based tracking). With the tracking by
detection problem, a common feature is to separate the detection results and use this
result to track the object. The next task is to find a way to associate the bounding
boxes obtained in each frame and assign an ID to each object. Therefore, the processing
steps for each new frame is as follows:
Detection: This step aims to detect the precision and locate the position of
objects in the frame. Any object detector can be applied. In the original paper
of SORT, the Faster R-CNN was utilized.
Prediction: This step utilizes the Kalman filter to predict the new positions of
objects at frame t based on previous t 1 frames.
Association: In case of multiple object tracking, it needs an association algorithm
to associate a target with a detected object. In SORT algorithm, Hungarian was
deployed for this purpose.
Hungarian Algorithm
The Hungarian algorithm [16] was developed and published in 1955 and proposed
as a solution to the assignment problem. Let denote n the number of detection (i =
1, 2, . . . , n) and m the number of tracks predicted (j = 1, 2, . . . m) as show in Fig.
2.8. The association of a detection i with a track j bases on a cost function that is
the distance between i and j in feature space. Detail of the Hungarian can be seen
in the original paper [16]. In the following, we just review some concept and ideas
of the algorithm. The Hungarian algorithm tried to associate each detection with its
22
2.3 People tracking
corresponding track so that the cost functions over all associations is minimal (Eq.
(2.11)).
Figure 2.8: Hungarian Algorithm [16].
z =
n
X
i=1
m
X
j=1
c
ij
x
ij
min (2.11)
where
P
n
i=1
x
ij
= 1, i = 1, 2, . . .
P
m
j=1
x
ij
= 1, j = 1, 2, . . .
x
ij
= 0 or 1
(2.12)
The set of x
ij
values that minimizes the total cost function z is considered as a
solution of the association. We rely on the following two theorems to find x
ij
:
Theorem 1: Suppose the cost matrix of the assignment problem is non-negative
and has at least n zero elements. Furthermore, if these n zeros are in n differ-
ent rows and m different columns, then the solution assigning person i to work
corresponding to this zero in row i will be the optimal solution (solution) of the
problem.
Theorem 2: Let C = [c
ij
] be the cost matrix of the assignment problem (n detect,
m track) and X
= [x
ij
] a solution (an optimal solution) to this problem. Suppose
C
is a matrix obtained from C by adding a number α = 0 (positive or negative)
23
2.3 People tracking
to each element in row r of C. Then X
is also the solution to the assignment
problem with the cost matrix C
.
Main steps of SORT algorithm The processing steps of SORT are shown in Fig.
2.9:
Step 1: Detection: The first step is to detect objects in each frame of a video
using a computer vision algorithm such as a neural network-based detector.
Step 2: Association: The next step is to associate the detected objects with
previously tracked objects from previous frames. This is done by comparing the
features of the detected objects with those of the existing tracked objects and
assigning a similarity score.
Step 3: Prediction: Once the association is made, the algorithm predicts the
position of the objects in the next frame using a Kalman filter or another motion
model.
Step 4: Update: In this step, the tracked objects are updated with the new
information from the current frame, such as the position and size of the objects.
Step 5: Track management: The final step involves managing the tracked objects,
such as removing objects that are no longer in the frame or creating new tracks
for newly detected objects.
It notices that there are three types of output of Hungarian algorithm: 1) It finds a
detection corresponding to a target (matched tracks). Then this association will be
used to update Kalman filter; 2) Unmatched tracks: no detection is found to match
with the track, then the track may be deleted depending on it lifetime; 3) If there are
new objects detected, which are not matched with any targets, then they will be used
to create new track.
24
2.3 People tracking
Figure 2.9: The tracking process of the SORT algorithm.
2.3.3.3 DeepSORT
In the original version of SORT, the cost function is defined based on the IoU distance,
it does not take the appearance similarity of detection and target into account. Deep-
SORT was developed by Nicolai Wojke and Alex Bewley [17] to address the omission
problems associated with a high number of ID switches. The solution proposed by
DeepSORT is based on using deep learning to extract features of objects to increase
accuracy in the data association process. In addition, a linking strategy known as
’Matching Cascade’ was developed to help link objects that had previously disappeared
more effectively.
DeepSORT is an improvement over the SORT (Simple Online and Realtime Track-
ing) algorithm in multiple ways:
Association metric: DeepSORT uses the Mahalanobis distance metric to associate
detected objects with existing tracks, while SORT uses the Euclidean distance
metric. The Mahalanobis distance takes into account the covariance matrix of
the data, which allows better handling data with varying scales and correlation
between dimensions. This results in a more accurate association of objects with
existing tracks, even when there are occlusions or other objects in the scene.
25
2.3 People tracking
Feature embedding: DeepSORT uses a deep neural network to embed object
features into a high-dimensional space, while SORT uses hand-crafted features
such as color histograms. Deep learning-based embeddings are more powerful
and expressive, allowing for better discrimination between objects and reducing
the risk of track drift.
Track management: DeepSORT employs a track management strategy that al-
lows better handling of occlusions, missed detection, and track fragmentation.
Specifically, it uses a Kalman filter to estimate the position and velocity of the
object, and a gating mechanism to filter out detection that are unlikely to belong
to the track. It also uses a track initiation process to start new tracks when no
existing track can be associated with a new detection.
Overall, DeepSORT’s improvements over SORT result in more accurate and robust
tracking, especially in challenging scenarios where objects are partially occluded or
move quickly.
2.3.4 Tracking-based people counting
Tracking-based people counting is a method of counting people by using computer
vision techniques to track individuals as they move through a space. There are several
different tracking-based people-counting methods, including:
1. Single-camera tracking [18]: This method uses a single camera to track people
as they move through the space. The camera captures images or video, and the
software analyzes the data to identify individuals and track their movements.
2. Multiple-camera tracking [19]: This method uses multiple cameras placed through-
out the space to capture images or video from different angles. The software com-
bines the data from each camera to track people as they move between different
areas.
3. Depth-based tracking [20]: This method uses cameras that can capture depth
information, such as Microsoft’s Kinect camera, to track people as they move
26
2.3 People tracking
through space. The software analyzes the depth data to identify individuals and
track their movements.
The research in the paper [21] is to develop an accurate and efficient system capa-
ble of error-free counting and tracking in public places. The main goal of this research
is to develop a system that can work well in different directions, at different den-
sities, and on different platforms. They proposed a new and precise approach that
includes pre-processing, object detection, person verification, particle stream analy-
sis, feature extraction, automatic organization map-based clustering (SOM), counting
people, and tracking people, as Fig.2.10. Initially, filters are applied to preprocess
images and detect objects. Next, random particles are distributed, and features are
extracted. Subsequently, particle flows are clustered using a self-organizing map, and
people counting and tracking are performed based on motion trajectories. The test
results on the PETS-2009 and TUD-Pedestrian datasets achieved high results.
Figure 2.10: Architecture of a people counting and tracking system [21].
The authors in [22] presented a novel multi-person tracking system for crowd count-
ing and normal or abnormal event detection in indoor and outdoor surveillance envi-
27
2.3 People tracking
ronments. The proposed system consists of four modules, as shown in Fig.2.11: peo-
ple detection, head-to-torso template extraction, tracking, and crowd cluster analysis.
Firstly, the system extracts human silhouettes using an inverse transform as well as a
median filter, reducing the cost of computing and handling various complex monitor-
ing situations. Secondly, people are detected by their heads and torsos due to their
being less varied and barely occluded. Thirdly, each person is tracked through consec-
utive frames using the Kalman filter technique with Jaccard similarity and normalized
cross-correlation. Finally, the template matching is used for crowd counting with cue
localization and clustering via Gaussian mapping for normal or abnormal event detec-
tion. The experimental results on two challenging video surveillance datasets, such as
PETS2009 and UMN crowd analysis datasets, demonstrate that the proposed system
provides 88.7% and 95.5% in terms of counting accuracy and detection rate, respec-
tively.
Figure 2.11: Flow architecture of the proposed smart surveillance system [22].
28
2.4 Conclusion of the chapter
2.4 Conclusion of the chapter
This chapter presented our study about methods for people counting based on people
detection and tracking and people density estimation. The detection and tracking-
based people counting techniques are suitable for the normally crowded scene while
the latter one is more suitable for highly crowded scenes. In this work, we follow the
first approach, which detects and tracks humans to improve the accuracy when an
occlusion appears. We will describe our proposed method in chapter 3.
29
Chapter 3
Proposed method for people
counting
In this chapter, we present the framework of our proposed method for people counting.
We then describe in detail each component of the framework, including: detection of
head and face using a finetuned Yolo-v5 and pre-trained RetinaFace, combination of
head and face detection results, tracking head using SORT algorithm.
3.1 The proposed people counting framework
We propose a method for people counting based on a combination of two detectors of
head and face and a tracker. Fig. 3.1 illustrates the main components of our method.
It composes of:
A head detection: As we mentioned previously, the head is a rigid object that can
be easily observed in a scene. Detecting the head is the most convenient to count
the people. In this work, we rely on Yolo to detect heads. We will fine-tune the
network with our dataset to make it more accurate to our experimental setting.
A face detection: Face detection algorithms have archived mature results. When
a head is missed detected, a face could be a recover to improve the true positive
rate. In our work, we utilized the pre-trained RetinaFace model because it has
30
3.2 Yolo-based head detection
been trained on a very large dataset. In addition, in our dataset, we do not have
annotated dataset of faces for fine-tuning.
Combination of face and head detection: We are given two sets of objects (face
and head) which are detected separately. We must align the face of one person
to his/her head. In this step, we deploy the Hungarian algorithm using the IoU-
based cost function to align a face with a head. The unaligned faces or heads
will be counted as a head with missed face or a face with missed head.
Person tracking: To avoid missed detection in each frame, we deploy a tracking
technique (SORT) which aims to keep track of all detected heads overtime.
In the following, we will describe in detail each component of our framework.
3.2 Yolo-based head detection
3.2.1 Yolo revisit
Yolo is a CNN network architecture for object detection, recognition, and classification.
For the classification problem, which is only capable of classifying objects by label
predictions, Yolo solves the object detection problem by being able to not only detect
many objects with many different labels but also determine specific locations of objects
in the same image using bounding boxes. Yolo means ”You only look once” which
means that because of the recognition speed of this model, Yolo is considered the fast
recognition model capable of real-time recognition. The Yolo architecture is built from
convolutional layers to extract feature features and fully connected layers to predict
object labels and positions. Input data is images, and the model will predict the
position, size, and label of the bounding boxes.
Yolo has evolved over many versions. Specifically with the first version, released
in May 2016 by Joseph Redmon in the article ”You Only Look Once: Unified, Real-
Time Object Detection [3]. This is a huge step forward for object detection. In
December 2017, Joseph and colleagues published another version of Yolo with the
31
3.2 Yolo-based head detection
Figure 3.1: The proposed framework for people counting by pairing head and face
detection and tracking.
32
3.2 Yolo-based head detection
paper ”Yolo9000: Better, Faster, Stronger” called Yolo9000 [23]. Less than a year
later, in April 2018, another version of Yolo was released, called Yolov3, with the
article ”Yolov3: An Incremental Improvement [24]. Exactly two years later, Alexey
Bochkovskiy introduced Yolov4 with the article ”Yolov4: Optimal Speed and Accuracy
of Object Detection ”[4].
Basically, instances of Yolo consist of two parts: the base network, which is a base
network of convolutional layers for feature extraction, and extra layers for detecting
features on the map. feature maps from convolutional layers.
This architecture is inspired by the GoogleNet network architecture for the image
classification problem. The network has 24 convolutional layers, followed by 2 fully
connected layers. Instead of using the inception module in GoogleNet, Yolo uses 1x1
convolutional layers to reduce the depth of feature maps. The last layer of the network
predicts the detection probability and coordinates of the corresponding rectangular
closures of the detected object simultaneously. These sizes are all normalized to the
range [0, 1]. The last layer of the network uses a linear activation function, while the
other layers use a rectangular linear activation function (formula (3.1):
ϕ(x) =
x, if x > 0
0.1x, vice versa
(3.1)
The output of the Yolo network has the following form:
y
T
= [p
0
, t
x
, t
y
, t
w
, t
h
| {z }
bounding box
range, p
1
, p
2
, . . . , p
c
| {z }
scores of C classes
}
(3.2)
Where
p
0
is the predicted probability that the object will appear in the bounding box.
t
x
, t
y
, t
w
, t
h
define the bounding box. Where t
x
, t
y
are the coordinates of the
center and t
w
, t
h
are the width and length dimensions of the bounding box.
p
1
, p
2
, ..., p
c
is the predictive probability distribution vector of the classes.
33
3.2 Yolo-based head detection
Thus, the output number of the Yolo network will be the number of classes + 5.
Depending on the number of anchors when training, the output number of the network
can change. Fig.3.2 illustrates the output of the Yolo network.
Figure 3.2: Output of Yolo network[3].
3.2.2 Yolov5
In this thesis, for head detection, among different presented detectors, we utilizes
Yolov5 because it has following advantages:
High Accuracy in COCO, Pascal VOC dataset
Speed: YOLOv5 is significantly faster than its predecessor YOLOv4
34
3.2 Yolo-based head detection
Customization: YOLOv5 is highly customizable and can be easily adapted to
different object detection tasks and datasets: supports transfer learning
Figure 3.3: Yolov5 architecture[4].
The improvements in Fig.3.3 were originally called Yolov4 but due to the release
of Yolov4 within the Darknet framework, to avoid version conflicts, it was renamed
Yolov5. There was quite a bit of controversy around the naming of Yolov5, and Yolov5’s
biggest contribution was translating the Darknet research framework into the PyTorch
framework. The Darknet framework is written primarily in C and provides granular
control over encrypted activities in the network. Yolo version 5 has five different sizes:
N for an extra small(nano) size model.
S for a small size model.
M for medium size model.
L for the large size model.
X for extra large size model
35
3.2 Yolo-based head detection
Yolo version 5 (Yolov5) is the most superior version compared to previous versions.
Yolov5 has made significant improvements, helping to increase accuracy compared to
the Yolo version 3 (Yolov3) [24] without affecting the speed factor. Yolov5 is a single-
stage detector, so it has three important parts like any other single-stage detector:
Model Backbone; Model Backbone; Model Backbone.
CSPDarknet53 Backbone In the previous version, Yolov3 used Darknet-53 as the
backbone. Darknet-53 is a combination of the backbone used in Yolov2 and Darknet-
19 and redundant network topology (residual networks). In this version, Yolov5 has
improved the Darknet-53 model by replacing regular ResNet blocks with CSPResNet
blocks. This new structure helps to increase the learning capacity of the CNN net-
work while reducing the computational volume and reducing the memory cost. More
specifically, CSPNet can be easily applied over ResNet, ResNeXt, and DenseNet. The
application of CSPNet on these networks reduces the amount of computation from 10%
to 20%, while outperforming the accuracy in the image classification problem.
Neck: SPP, PAN
Spatial Pyramid Pooling (SPP): In the article [25], the authors added Yolov4 an
SPP block to optimize both global features and local region features of many
sizes, increasing the number and size of the receptive field as shown in Fig. 3.4.
Path Aggregation Network (PAN): In the Yolov3 version, the authors use the
FPN (Feature Pyramid Network) to synthesize global features at different con-
volution layers. This is done differently in the Yolov4 version with the use of an
enhanced version of PAN to aggregate information from all layers into a single
output, as shown in Fig. 3.5.
Model Head : the model head is used to perform the final detection part. Head
applies anchor boxes on the features and generates final output vectors with class
probabilities, confidence scores on the feature, and object bounds prediction. For each
batch of training data, Yolov5 passes the training data through the dataloader, which
36
3.2 Yolo-based head detection
Figure 3.4: Spatial Pyramid Pooling [4].
Figure 3.5: Path Aggregation Network [4].
enhances the online data. The data loader performs three types of enhancement: scale
adjustment, color space adjustment, and mosaic boost. The most novel thing in this
version is increasing the mosaic data, which combines four images into four tiles in
random proportions. Mosaic data is derived from Yolov3 PyTorch and is now Yolov5.
Mosaic up-scaling is particularly useful for the COCO benchmark object detection,
which helps the model learn to solve the object size problem: small objects are not
37
3.2 Yolo-based head detection
detected as accurately as larger objects.
Automatic learning of bound box anchors : In Yolov3 PyTorch, Glenn Jocher
introduced the idea of learning anchors based on the distribution of regions around
objects in a custom dataset with genetic learning algorithms and K-means, illustrated
in Fig. 3.6. This is important for custom data, as the distribution of dimensions and
location around the object can be significantly different from the previously learned
anchor boxes in the COCO dataset. Fig.3.6 shows the idea of learning bound box
anchors.
Figure 3.6: Automatic learning of bound box anchors [4]
Activation Function : Yolov5 uses activation functions SiLU and Sigmoid. SiLU
stands for Sigmoid Linear Unit, and it is also called the Swish activation function. It
has been used with convolution operations in the hidden layers. While the sigmoid
activation function has been used with the convolution operations in the output layer.
3.2.3 Implementation of Yolov5 for head detection
In our work, we fine-tune the Yolov5 for our head detection. Following are the main
steps.
Installation
38
3.3 RetinaFace based face detection
Figure 3.7: Activation functions used in Yolov5. (a) SiLU function. (b) Sigmoid
function [4]
1. Git clone from yolov5.git
2. Install python =3.7.0 and pytorch =1.7.
3. Install library using the command: pip install -r requirements.txt
Dataset Preparation
1. Create dataset.yaml as Fig.3.8.
2. Organize directories
3. Select model.
4. Train model.
Model training and testing
To train the model, the command below is used: $ python train.py -img 640 -batch
16 -epochs 100 -data coco128.yaml -weights yolov5l.pt
To evaluate the model, we use the command below: $ python eval.py -img 640
–batch 16 -data coco128.yaml -weights yolov5l.pt
3.3 RetinaFace based face detection
In this thesis, for face detection, among different presented detectors, we utilizes Reti-
naFace because it has following advantages:
39
3.3 RetinaFace based face detection
Figure 3.8: Example for creating dataset.yaml.
High Accuracy: RetinaFace achieves accuracy on several benchmarks: WIDER
FACE, AFW, PASCAL FACE, and FDDB
Robustness: RetinaFace is designed to be robust to occlusions, variations in
lighting
Speed: RetinaFace is fast and can process high-resolution images in real-time on
modern GPUs.
Multi-tasking: RetinaFace can detect faces, facial landmarks, and estimate face
poses in a single pass
Open Source: RetinaFace is an open-source framework, making it accessible to
researchers and developers
3.3.1 RetinaFace architecture
RetinaFace is a detection network that uses multi-tasking learning techniques [26].
RetineFace performs three different face localisation tasks together, that are face de-
tection, 2D face alignment and 3D face reconstruction based on a single shot framework.
3D Face Reconstruction For creating a 3D face from the 2D image, they are using
a predefined triangular face with N vertices as shown in the above figure. The vertices
shares the same semantic meaning across different faces and with the fixed triangular
40
3.3 RetinaFace based face detection
topology each face pixel can be indexed by barycentric coordinates and the triangle
index making pixel wise correspondence with the 3D face.
For regressing the 3D vertices on the 2D image plane, they are using two loss
functions as show in (Eq.(3.3)):
L
vert
=
1
N
N
X
i=1
V
i
(x, y, z) V
i
(x, y, z)
1
(3.3)
Where: N is the total vertices and V is predicted point and V
is grount-truth
point.
L
edge
=
1
3M
M
X
i=1
E
i
E
i
1
(3.4)
It is the edge length loss, as it is a triangular topology. Here, M is the number of
triangles and is E is predicted edge length and E
is ground truth edge length.
So the total loss loss for regressing 3D points becomes equation as show in (Eq.(3.5)):
L
mesh
= L
vert
+ λ
0
L
edge
(3.5)
Multi-Level Face Localisation The complete loss function for an anchor i becomes
as (Eq.(3.6)) :
L = L
cls
(p
i
, p
i
) + λ
1
p
i
L
box
(t
i
, t
i
)
+ λ
2
p
i
L
pts
(l
i
, l
i
) + λ
3
p
i
L
mesh
(v
i
, v
i
) .
(3.6)
The loss function has four parts:
Softmax loss for binary classes (face/ not face), where, p is the predicted proba-
bility that anchor i is face and p* is ground truth.
Regression loss of bounding box.
Regression loss of five landmarks
Regression loss of 3D points as discussed above.
41
3.3 RetinaFace based face detection
All the coordinates are normalized as show in (Eq.(3.7)):
x
j
x
a
center
/s
a
,
y
j
y
a
center
/s
a
,
z
j
z
nose-tip

/s
a
(3.7)
Where: x
and y
are the ground truth coordinates of face box corner, five face
landmark and (x
, y
, z
) are 3D ground truth vertices in image space of the j
th
face.
All the 3D ground truth vertices are translated such that the z coordinate of the nose
tip is zero. x
α
, y
α
are the center coordinate of the face bounding box and s
α
is the
scale. Width and Height of the box is also normalized as log(w
/s
α
) and log(h
/s
α
),
where w
and h
ground truth dimension of face box.
Figure 3.9: An overview of the single-stage dense face localisation approach. Reti-
naFace is designed based on the feature pyramids with independent context modules.
Following the context modules, we calculate a multi-task loss for each anchor[26].
Single-shot Multi-level Face Localisation The model consists of three main com-
ponenets as shown in Fig.3.9:
Feature Pyramid Network: It takes the input image and outputs five feature
maps of different scales. First four feature map in above figure is calculated
using ResNet which was pre-trained on imagenet-11k dataset. The top most
feature map was by the convolution of 3x3 with stride 2 on C5.
Context Head Module: To strengthen the context modelling capacity deformation
convolutional network(DCN) is used in this module over the feature maps other
than normal 3x3 convolution.
42
3.3 RetinaFace based face detection
Cascade Multi Task Loss: To improve face localisation cascade regression is used
along with multi-task loss as described above. The first context module predicts
the bounding box using the regular anchors and then subsequent modules predicts
more accurate bounding box using the regressed anchors.
3.3.2 Implementation of RetinaFace for face detection
We utilized the pre-trained model of RetinaFace for face detection because it has been
trained on a big dataset WIDER FACE. The dataset consists of 32.203 images and
393.703 face bounding boxes with a high degree of variability in scale, pose, expression,
occlusion and illumination. Following are the main steps of utilization of RetinaFace.
Installation
1. git clone from pytorch Retinaface.git
2. pytorch version 1.1.0+ and torchvision 0.3.0+ are needed.
3. Codes are based on Python3
Dataset
We organize the dataset directory as follows:
Figure 3.10: Organize dataset for Yolo training.
Testing
Testing widerface validation by the command below:
43
3.4 Combination of head and face detection
Figure 3.11: Example of RetinaFace testing on Wider Face dataset.
3.4 Combination of head and face detection
To reduce the missed detection rate of both head and face detectors, we deploy both
detectors and fuse the detection results.
3.4.1 Linear sum assignment problem
The resulting sets F
head
and F
face
can be directly used to derive the total number of
people. We formulate face and head pairing as a linear sum assignment problem, which
involves finding unique and non-repeating pairs of head and face boxes that belong to
one person. It is similar to the Hungarian algorithm. Formally, we seek an association
matrix X of shape (M × K) whose elements are defined as:
X
ij
=
1, if head i is assigned to face j
0, if head i is not assigned to face j
(3.8)
This is solved by optimizing the formula below:
min
xX
M
X
i=1
K
X
j=1
C
ij
X
ij
(3.9)
s.t.
M
X
i=1
X
ij
= 1i,
K
X
j=1
X
ij
= 1j
Where M is the number detected heads, (x
i
1
, y
i
1
) are coordinates of the top left
corner of the head bounding box and (x
i
2
, y
i
2
) are coordinates of the bottom right
corner of the head bounding box, with i = (1, ..., M) and K is the number of detected
faces, (p
j
1
, q
j
1
) are coordinates of the top left corner and (p
j
2
, q
j
2
) are coordinates of the
bottom right corner of the face bounding box, with j = (1, ..., K). We need to define
the association cost C
ij
that stands for the opposite probability of the i
th
head to be
44
3.5 Person tracking
paired with the j
th
face.
3.4.2 Head-face pairing cost
Given an arbitrary pair of bounding boxes i
th
head [(x
i
1
, y
i
1
), (x
i
2
, y
i
2
)] and j
th
face
[(p
j
1
, q
j
1
), (p
j
2
, q
j
2
)], taken from the two detection outputs F
head
and F
face
respectively,
we calculate IoU of head i and face j. After iterating all bounding boxes and calcu-
lating the corresponding IoU for each pair, a cost matrix C is returned. The formula
of the cost function is defined in (Eq.(3.10)). Accordingly, the higher IoU is, the lower
cost is (higher probability of true match), and vice versa.
C
ij
= 1 IoU
ij
(3.10)
Once we solve the assignment problem using the Hungarian algorithm, we perform
a post-filtering process by recalculation of IoU to confirm the correct assignments. As
there is no constraint on output in the vanilla linear sum assignment problem, we
further suppress possible false assignments X
ij
= 0 if the IoU
ij
σ, where σ is a
threshold. We set σ = 0 in all experiments for maximum tolerance (only filtering out
matches with IoU
ij
= 0). The unassociated face boxes can be further interpreted as a
restored head box for improving the recall rate.
3.5 Person tracking
Object detection is a problem with many challenges, the most difficult of which is
the problem of mutual occlusion. In this proposal, we try to be able to keep high
true positive rate using the object tracking method. In this section, the objects to be
tracked are head, coming from the head detector or from the output of combination
of head and face detectors as shown in Fig.3.12. In both cases, we use the SORT
algorithm to track objects in consecutive frames.
45
3.5 Person tracking
Figure 3.12: Flowchart of combining object detection and tracking to improve the true
positive rate.
46
3.5 Person tracking
The state vector in tracking algorithm is defined by (Eq.(3.11)):
x = [x
cen
, y
cen
, s, r, ˙x, ˙y, ˙s]
T
(3.11)
Where:
x
cen
, y
cen
are the coordinates of the head center respectively.
s is the area of the bounding box.
r is the aspect ratio of the bounding box.
˙x, ˙y, ˙s are the respective velocity values of x
cen
, y
cen
, s
The transition model of the state vector are represented as follows:
x
k
= x
k1
+ ˙x
k1
dt
y
k
= y
k1
+ ˙y
k1
dt
s
k
= s
k1
+ ˙s
k1
dt
r
k
= r
k1
˙x
k
= ˙x
k1
˙y
k
= ˙y
k1
˙s
k
= ˙s
k1
(3.12)
Then the equation will be briefly represented as :
x
k
= F
k
x
k1
+ w
k
(3.13)
where w
k
is the process noise, which is assumed to be drawn from a zero mean
multivariate normal distribution.
47
3.5 Person tracking
x
k
y
k
s
k
r
k
˙x
k
˙y
k
˙s
k
=
1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 0 1 0 0 0 1
0 0 0 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0 0 0 0 0 1
x
k1
y
k1
s
k1
r
k1
˙x
k1
˙y
k1
˙s
k1
+ w
k
(3.14)
At time k an observation (or measurement) z
k
of the true state x
k
is made according
to:
z
k
= H
k
x
k
+ v
k
(3.15)
where v
k
is the observation noise, which is assumed to be zero mean Gaussian white
noise with covariance.
First, from the result of head detection, we obtain the head coordinates on the
image, which are top-left (x1,y1) and bottom-right (x2,y2) coordinates. Then we put
in the process of tracking objects using the SORT algorithm. The processing steps of
SORT are included:
Step 1: Detection: The first step is to detect objects in each frame of a video
using a computer vision algorithm such as a neural network-based detector.
Step 2: Association: The next step is to associate the detected objects with
previously tracked objects from previous frames. This is done by comparing the
features of the detected objects with those of the existing tracked objects and
assigning a similarity score.
Step 3: Prediction: Once the association is made, the algorithm predicts the
position of the objects in the next frame using a Kalman filter or another motion
model.
Step 4: Update: In this step, the tracked objects are updated with the new
48
3.6 Conclusion
information from the current frame, such as the position and size of the objects.
Step 5: Track management: The final step involves managing the tracked objects,
such as removing objects that are no longer in the frame or creating new tracks
for newly detected objects.
3.6 Conclusion
In this chapter, we present technical details about the proposed method, including de-
tection techniques, tracking techniques, and a strategy to combine multiple detections.
Overall, our proposed framework consists of four main modules: Head Detection, Head
Tracking, Multi Detection (MultiDetect) and Multi Detection with Tracking (Multi-
Detect with Track).
49
Chapter 4
Experiments
In Chapter 3, we presented our proposed method for the people counting problem. In
this chapter, we describe the datasets, the protocol of evaluation, and the experimental
results. We first compare the people counting result based on the single head and
face detection with combined face and head detection (MultiDetect). We then make
the comparison of MultiDetect with the MultiDetect with Track result. Qualitative
analysis is also provided.
4.1 Dataset and Evaluation Metrics
We evaluate our proposed method on three benchmark datasets (Hollywood, Casablanca,
Wider Face) and one self-collected dataset. In the following, we will present detailed
information of each dataset.
4.1.1 Our collected dataset: ClassHead
As aforementioned, this thesis is within the context of the MOET project number
CT2020.02.BKA.02. This project aims to build a system that observes the classroom
with multiple camera, processes the video data in order to count and locate the students
as well as analyze the student’s activities in the class. As a consequence, we have
collected a dataset for this purpose.
50
4.1 Dataset and Evaluation Metrics
Equipment setup and Data collection The dataset is collected in two places: the
seminar room on the 9
th
floor of building B1 of MICA Institute and the classroom at
Thuyloi University. The seminar room’s layout resembles that of a small classroom,
capable of accommodating around 50 students (the room’s area is about 100 m
2
with a
height of 3.5m). The room has large windows, and the lighting system includes 60-cm
neon bulbs. We collected 20 lessons with four different classes and obtained 20 raw
videos. To observe the students and their activities, we have installed multiple cameras
at different locations. The parameters of the cameras have been set to fixed values for
all collection sessions, as presented in Tab.4.1. Data was collected over several lessons
with various subjects and numerous students to ensure the practicality and reliability
of the experiment. Fig.4.1 depicts the layout of the room and camera locations as
well as the environment and the images collected from the cameras. In addition, we
conducted a survey in the classroom at ThuyLoi University: The cameras were installed
3m above the ground, collected in 5 classrooms, each classroom can accommodate up
to 30 people; For each of these classrooms we collected a few short videos for the object
tracking problem.
Table 4.1: Setup camera parameters for data collection.
Parameter Value
Image resolution 1920*1080
Sampling rate 25fps
Bitrate type Fixe
Video encoding type H264
Data annotation After collecting the raw video sequences, we need to annotate the
data for the sub task of people counting. To do this, we extracted frames from the
videos taken from five cameras by sampling at a frame rate of 25 fps. Next, we used the
annotation tool LabelMe
1
to annotate the data. The annotations consist of drawing
bounding boxes around the heads of people and assigning the label ’Head’ to the
bounding box for each frame. Fig.4.2 illustrates the tool LabelMe and an annotation
1
http://labelme.csail.mit.edu/Release3.0/
51
4.1 Dataset and Evaluation Metrics
Figure 4.1: Camera layout in the simulated classroom and an image obtained from
each camera view.
52
4.1 Dataset and Evaluation Metrics
for an image.
Figure 4.2: Illustration of LabelMe interface and main operations to annotate an image.
Our collected dataset is divided into three parts: Part 1, Part 2. Part 1 is collected
at the MICA research institute, and Part 2 is collected at Thuyoi University.
4.1.1.1 ClassHead Part 1
Tab.4.2 summarizes the number of images and heads for training and testing in our
collected dataset. The Part 1 dataset consists of 400121 heads in 14017 images. The
training part consists of 7943 images with 249822 annotated heads. The testing part
consists of 6074 images with 150299 annotated heads. We utilize the ClassHead Part 1
for finetuning Yolov5 and testing head detection. Fig.4.3 shows the images taken from
five cameras. We observe that some people are clearly observed by a camera but may
be obscured by other cameras.
53
4.1 Dataset and Evaluation Metrics
(a) (b)
(c) (d)
(e)
Figure 4.3: Illustration of images taken from five camera view in ClassHead Part 1
dataset: (a) View 1 , (b) View 2, (c) View 3, (d) View 4 and (e) View 5.
54
4.1 Dataset and Evaluation Metrics
Table 4.2: ClassHead Part 1 dataset for training and testing Head detector Yolov5
Data Number Cam 1 Cam 2 Cam 3 Cam 4 Cam 5
Train
Images 2476 1587 1397 1265 1218
Heads 80187 52346 36978 41015 39296
Test
Images 1209 1231 1170 1200 1264
Heads 31376 31562 24768 30235 32358
Table 4.3: ClassHead Part 2 dataset
Dataset Ch03 Ch04 Ch05 Ch12 Ch13
Number of images 100 100 100 100 100
Number of objects 1500 2400 2500 2571 2500
4.1.1.2 ClassHead Part 2
Part 2 dataset is collected at ThuyLoi University and including 500 collected images
from cameras at five views shown in Fig.4.4 from different viewing angles with a res-
olution of 1920 x 1080 pixels as shown in this last row in Fig.4.5 and the number of
images as shown in Tab.4.3. We utilize this ClassHead Part 2 to compare the head
detection method with combination head and face method (MultiDetect), and multiple
detection with track (MultiDetect with Track).
4.1.2 Hollywood Heads dataset
Hollywood Heads
1
is a large-scale dataset [27] derived from 21 Hollywood movies.
This dataset is particulary designed for head detection problems in natural scenes
where people are under a full variation of camera viewpoints, human poses, lighting
conditions, and occlusions. As the dataset is collected from movies, the number of
people in the scene is not really high, but the pose of humans and cameras and the
lighting condition can vary strongly. Fig.4.5(first row) illustrates some frames extracted
from the dataset. We follow the same data splitting as in the original paper [27] to
train and test our algorithms. Specifically, we use 216,719 frames from 15 movies for
training, 6,719 frames from 3 movies for validation, and 1302 frames from 3 movies
for testing. In the experimental part of the proposed methods, we compare the single
1
https://www.di.ens.fr/willow/research/headdetection/
55
4.1 Dataset and Evaluation Metrics
(a) (b)
(c) (d)
(e)
Figure 4.4: Some example images of ClassHead Part 2 dataset: view ch03 (a), view
ch04 (b), view ch05 (c), and view ch12 (d) and view ch13 (e).
56
4.1 Dataset and Evaluation Metrics
detector with Multi Detector on the Hollywood Heads dataset.
4.1.3 Casablanca dataset
Casablanca dataset
1
contains frames that are extracted from the movie Casablanca
[28], in which the frames are grayscale and were taken under poor lighting and in
crowded scenes. It comprises 1466 frames with annotated head bounding boxes. Fig.
4.5(second row) illustrates some frames extracted from the Casablanca dataset. The
Casablanca dataset is annotated like the Hollywood dataset except that the frontal
head annotation has been reduced to faces. In the experimental part for the proposed
methods, we experiment with the Casablanca dataset for the head detection problem
and the Multi Detector problem.
4.1.4 Wider Face dataset
Wider Face dataset
2
is a large-scale face detection dataset [29]. The dataset contains
over 32,000 images with 393,703 annotated faces. The dataset has a diverse range
of images, with varying lighting conditions and backgrounds. The dataset has been
split into training, validation and testing sets with a ratio 5:1:4. Fig. 4.5(third row)
illustrates some frames extracted from the Wider Face dataset. The annotations are
provided in a standard format with bounding box coordinates and a visibility score for
each face. The dataset has been widely used in research on face detection and recog-
nition, and it includes evaluation scripts for measuring performance using standard
metrics.
4.1.5 Evaluation metrics
To evaluate the people-counting algorithm, we compare the number of people detected
automatically by the algorithm with the actual number of people present in the frame
as determined by human observation. Automatic people counting could be determined
1
http://www.di.ens.fr/willow/research/headdetection/
2
http://shuoyang1213.me/WIDERFACE/
57
4.1 Dataset and Evaluation Metrics
Figure 4.5: Some example images of Hollywood Heads dataset (first row), Casablanca
dataset (second row), Wider Face dataset (third row), and ClassHead Part 2 of our
dataset (last row).
58
4.1 Dataset and Evaluation Metrics
by Head detector, Face detector, Combined Head and Face (MultiDetect), or Combined
Head and Face with Track (MultiDetect with Track). Manual people counting is based
on groundtruth given in the dataset. It would be head (in Hollywood, Casablanca,
ClassHead datasets) or face (in the Wider Face dataset).
A detection result generated from any of the above detectors is considered a true
positive based on the IoU metric. We then compute the Precision and Recall of the
detectors to compare their performance on each frame.
4.1.5.1 Intersection over Union (IoU)
The Intersection over Union (IOU) is a measure calculated based on the ratio of the
overlapping area (overlap) to the total area of two spatial regions containing the zoned
object, either automatically or manually. Fig4.6 illustrates how to compute IOU. The
IOU value ranges from 0 to 1. When the detected region and the ground-truth region
are completely separate, IOU = 0. On the other hand, IOU = 1 when the two regions
overlap absolutely. The IOU threshold is often chosen to be 0.5, and detection is kept
or discarded depending on this threshold. A detection is true positive if IoU >= 0.5.
Figure 4.6: Calculating IOU [30].
4.1.5.2 Precision and Recall
To evaluate the performance of detection algorithms, we use two common metrics
called Precision and Recall. Fig.4.7 shows how to calculate Precision and Recall with a
59
4.1 Dataset and Evaluation Metrics
binary classifier. Precision is defined as the ratio between the correct predictions (True
Positive) to the total predictions (True Positive + False Positive). Meanwhile, Recall
(sensitivity) is the ratio between the correct predictions (True Positive) to the sum of
the true instances (Ground Truth = True Positive + False Negative).
True Positive(TP): the total number of correct predictions classified as Posi-
tive.
False Positive (FP): total number of false predictions classified as Positive.
False Negative (FN): total number of false predictions classified as Negative.
True Negative (TN): the total number of correct predictions classified as Neg-
ative.
Figure 4.7: Precision and Recall metrics [30].
4.1.5.3 F1-score
On the basis of Precision and Recall, F1-score, also known as the median Harmonic
mean, is defined as the ratio between the total number of correct findings and the mean
of the automatic findings. F1-score is a more objective representation of performance.
This value is in the range (0, 1]. The formula (4.1) represents the equation of F
1
score.
60
4.1 Dataset and Evaluation Metrics
F
1
score = 2×
Precision × Recall
Precision + Recall
(4.1)
4.1.5.4 AP and mAP
Average precision (AP) is also a commonly used metric. The AP measure at an IoU
threshold α is determined as follows:
AP @α =
Z
1
0
p(r)dr (4.2)
Where AP is the area under the curve generated by the Precision and Recall curves.
The larger the AP, the more efficient the method. In many studies, it is common to use
AP50 and AP75 as the values of AP when the IoU thresholds used are 50% and 75%.
For the multi-class detection problem, the average mAP of the classes is calculated as
follows:
mAP @α =
1
n
n
X
i=1
AP
i
for n class. (4.3)
4.1.5.5 Mean Absolute Error
The above measurements are used to evaluate the effectiveness of the head and face
detectors. Within the framework of my master’s thesis on the problem of people
counting, we use an additional measure called MAE which represents the average error
of counting people per frame. Therefore, the smaller the MAE, the better the person
counting results. The formula of MAE is below:
MAE =
1
N
N
X
i=1
|y
i
ˆy
i
| (4.4)
Where:
N: number of frames
y
i
: number of actual heads (faces) in the i
th
frame (ground truth)
61
4.2 Experimental Results
ˆy
i
: number of detected heads (faces) in the i
th
frame (detection or/and tracking
result)
4.2 Experimental Results
4.2.1 Evaluation on Hollywood dataset
Hollywood Heads is a head-only-annotated dataset. We compare the performance of
a single head detector (using fine-tuned Yolov5 on ClassHead Part 1 dataset), sin-
gle face detector (pre-trained RetinaFace) and our combination method (MultiDetect)
that combines face and head using the Hungarian algorithm. MultiDetect with Track
method is not evaluated because the tracking ground-truth is unavailable with this
dataset.
Table. 4.4 shows the Precision, Recall, F1-score, and AP of head detection which
are 88.32%, 79.65%, 83.76%, and 77.49% respectively. When we apply a face detector,
we can retrieve additional heads that are ignored by the head detector. It is noticed
that a detected face is considered a true positive if IoU of the face region and the
ground-truth head is greater than 0.9. We observe that we can detect 54 additional
faces corresponding to 54 missed heads, but the byproduct is 10 more false positives.
In a face-head combined scheme, we get higher recall rates (81.69%) and AP (79.45%)
as shown in Tab.4.4.
Some illustrative results of head and face detection are shown in Fig. 4.9. Correct
detected heads, missed heads (false negatives) and correct detected faces are bounded
by green, red and yellow boxes, respectively. We can see that due to the non-frontal
orientation in Hollywood Heads; both head and face detectors could not ensure per-
fect results. As shown in Fig. 4.9a and Fig. 4.9b, the red rectangles are missed heads
or faces. As a result, head or face detection-based people counting would be unreli-
able. With the pairing technique, the number of both paired and unpaired boxes is
significantly more precise (Fig. 4.9c).
In addition, we perform an evaluation with the MAE measure between the pre-
62
4.2 Experimental Results
dictive model results and the ground truth. The results show that the MultiDetect
method achieves better results with a MAE of 0.514, a decrease of 0.016 compared to
the Head Detection method as Fig.4.8 below:
Figure 4.8: MAE measurement results on 2 proposed methods in Hollywood Heads
dataset.
Table 4.4: Results of the proposed method on the Hollywood Heads dataset.
Metrics
Hollywood Heads dataset
Head Detection MultiDetect
Precision 88.32 88.22
Recall 79.65 81.69
F1-score 83.76 84.83
AP 77.49 79.45
MAE 0.53 0.514
4.2.2 Evaluation on Casablanca dataset
Similarly, on the Casablanca dataset, with a supplementary face detector, we recover 42
additional faces. Consequently, Recall value increases from 92.11% to 93.24%, and AP
increases from 88.36% to 89.34% as shown details in Tab.4.5. Besides, we illustrative
results of head and face detection are shown in Fig. 4.11. Correct detected heads,
missed heads (false negatives) and correct detected faces are bounded by green, red and
63
4.2 Experimental Results
(a) (b)
(c)
Figure 4.9: Results of Hollywood Heads dataset. (a) Results of head detection; (b)
Results of face detection; (c) Matching head and face detection using the Hungarian
algorithm. Heads are denoted with green, faces are yellow, missed ground truths are
red, and head-face pairings are cyan.
yellow boxes respectively. We can see that due to the heavy occlusion in Casablanca;
both head and face detectors could not ensure perfect results. As shown in Fig. 4.11a
and Fig. 4.11b, the red rectangles are missed heads or faces. As a result, head or face
detection-based people counting would be unreliable. With the pairing technique, the
number of both paired and unpaired boxes is significantly more precise (Fig. 4.11c).
In addition, we perform an evaluation with the MAE measure between the predictive
model results and the ground truth. The results show that the Head Detection method
achieves better results with a MAE of 0.633, a decrease of 0.069 compared to the
MultiDetect method as Fig.4.10 below.
Table 4.5: Results of the proposed method on the Casablanca dataset.
Metrics
Casablanca dataset
Head Detection MultiDetect
Precision(%) 79.74 79.38
Recall(%) 92.11 93.24
F1-score(%) 85.48 85.75
AP(%) 88.36 89.34
MAE 0.633 0.702
64
4.2 Experimental Results
Figure 4.10: MAE measurement results on 2 proposed methods in Casablanca Heads
dataset.
(a) (b)
(c)
Figure 4.11: Results of Casablanca dataset. (a) Results of head detection; (b) Results
of face detection; (c) Matching head and face detection using the Hungarian algorithm.
Heads are denoted with green, faces are yellow, missed ground truths are red, and head-
face pairings are cyan.
65
4.2 Experimental Results
4.2.3 Evaluation on Wider Face dataset
On the Wider Face dataset, a face-only-annotated dataset, we consider ground-truth
faces as heads. We only achieve 74.89%, 71.61%, 73.21%, and 64.43% of precision, re-
call, F1 and AP, respectively. We get 2,222 additional faces but also 617 false positives.
The resulting scores increase significantly to 75.12%, 77.2%, 76.15%, and 69.23%, as
shown in Tab. 4.6. Additional, we show some results of head and face detection are
shown in Fig. 4.12. Correct detected heads, missed heads (false negatives) and correct
detected faces are bounded by green, red and yellow boxes respectively. We can see
that due to the crowdedness in the Wider Face dataset, both head and face detec-
tors could not ensure perfect results. As shown in Fig. 4.12a and Fig. 4.12b, the red
rectangles are missed heads or faces. As a result, head or face detection-based people
counting would be unreliable. With the pairing technique, the number of both paired
and unpaired boxes is significantly more precise (Fig. 4.12c).
Table 4.6: Results of the proposed method on Wider Face dataset.
Metrics
WiderFace dataset
Face Detection MultiDetect
Precision(%) 95.26 75.12
Recall(%) 62.58 77.2
F1-score(%) 75.54 76.15
AP 62.07 69.23
4.2.4 Evaluation on ClassHead Part 2 dataset
Head Detection
First, we evaluate the head detection method on the ClassHead Part 2 dataset.
The results are shown in Tab. 4.7, the results show that this method works best on
view ch04 with a Precision of 95.51%. In addition, the Recall measure achieves the
best results on view ch04 with a value of 96.58%. The reason is that view ch03 has the
camera placed at an angle, with a narrow viewing angle while the rest of the views are
placed in central positions or in wide viewing angle positions.
MultiDetect
66
4.2 Experimental Results
(a) (b)
(c)
Figure 4.12: Results of Wider Face dataset. (a) Results of head detection; (b) Results
of face detection; (c) Matching head and face detection using the Hungarian algorithm.
Heads are denoted with green, faces are yellow, missed ground truths are red, and head-
face pairings are cyan.
Table 4.7: Results of the method of the head detection method in ClassHead Part 2
dataset.
Dataset ch03 ch04 ch05 ch12 ch13
Precision(%) 71.43 95.51 91.64 88.97 95.08
Recall(%) 76 96.58 89.4 95.41 94.4
F1-score(%) 73.64 96.04 90.51 92.08 94.74
AP(%) 66.22 95.58 88.62 87.65 93.97
The results of the combined head and face (MultiDetect) method are described in
detail in five views from ch03 to ch13. In which the results show in Tab 4.8 that the
method increase the Precision is 9.71%, 3.01%, 0.83%, 0.46% in corresponding views
ch03, ch04, ch05, and ch12 compared with head detection method. Besides, also with
this method, Recall increases 12.6%, 3.09%, 1.04%, 1.05%, 1.68% in correspondence
to views ch03, ch04, ch05, ch12, and ch13 compared with head detection method.
In addition, we illustrate the results of the method in Fig.4.13. In Fig.4.13(a), the
green bounding box shows the results of head detections using Yolov5. In addition,
in Fig.4.13 (b), the yellow bounding boxes represent the results of face detections.
Finally, Fig.4.13 (c) is the combined result of head detections and face detections,
67
4.2 Experimental Results
Table 4.8: Results of the method of the MultiDectect in ClassHead Part 2 dataset
View ch03 ch04 ch05 ch12 ch13
Precision(%) 81.14 98.52 92.47 89.43 92.85
Recall(%) 88.6 99.67 90.44 96.46 96.08
F1-score(%) 84.71 99.09 91.44 92.81 94.44
AP(%) 81.91 99.56 89.9 88.73 95.59
(a) (b)
(c)
Figure 4.13: MultiDetect results in ClassHead Part 2. (a) Head detections, (b) Face
detections, (c) MultiDetect
which includes green boxes and yellow bounding boxes. We consider additional face
detections as missing heads due to the object detection model missed by Yolov5.
Head Tracking
We do this part on the dataset which is collected by ourselves because sorting is
the process of tracking many consecutive frames, so we proceed using our ClassHead
Part 2 dataset. On other datasets, because the frames are discrete, we cannot perform
re-utilization for this approach.
Object tracking is one of the methods commonly used today, we apply the tracking
techniques in this people counting problem. The results show that this method achieves
quite positive results, as illustrated in Tab. 4.9. This method increases the precision
68
4.2 Experimental Results
(a) (b)
Figure 4.14: Head tracking method results in ClassHead Part 2 dataset. (a) Head
detections at frame 1, (b) Head tracking at frame 100.
at views ch03, ch04, and ch05 to 9.49%, 2.689%, and 0.219%, respectively, using the
tracking method when compared with head detection method. Also, with this method,
Recall increases 12.8%, 3.13%, 1.2%, 1.28%, 1.28% in corresponding to views ch03,
ch04, ch05, ch12, ch13 when compared with head detection method.
Besides, we illustrate the results of the tracking process as shown in Fig.4.14. In Fig.
4.14(a) the green bounding boxes are the first objects obtained from object detection
using Yolov5. Fig. 4.14(b) illustrates the object tracking obtained by cyan bounding
boxes compared to the green bounding boxes obtained from object detection.
Table 4.9: Results of the Head Tracking in ClassHead Part 2 dataset.
View ch03 ch04 ch05 ch12 ch13
Precision(%) 80.92 98.19 91.85 88.6 94.21
Recall(%) 88.8 99.71 90.6 96.69 95.68
F1-score(%) 84.68 98.94 91.22 92.47 94.94
AP(%) 81.8 99.56 89.86 88.69 94.56
MultiDetect with Track
Because of the results in the previous step, we have shown that the method of
combining head and face (MultiDetect) is very promising. Therefore, we combined
the results from the previous section 4.2.2 in MultiDetect with Track method. The
results are specifically illustrated in Tab. 4.10. More specifically, by MultiDetect with
Track method, Recall value at view ch03 is up 5.1%, 0.08%, 1.24%, 1.21%, 2.2% when
compared with the MultiDetect method.
69
4.2 Experimental Results
(a) (b)
Figure 4.15: MultiDetect with Track method results in ClassHead Part 2 dataset. (a)
MultiDetect with Track at frame 1, (b) MultiDetect with Track at frame 100.
In addition, we illustrate the results of the MultiDetect with Track method as
shown in Fig. 4.15. In Fig. 4.15(a) the green bounding boxes are the first objects ob-
tained from object detection using Yolov5. Fig. 4.15(b) illustrates the object tracking
obtained with the cyan bounding boxes compared to the green and yellow bounding
boxes obtained from the MultiDetect method.
Table 4.10: Results of method MultiDetect with Track in ClassHead Part 2 dataset.
View ch03 ch04 ch05 ch12 ch13
Precision(%) 69.55 92.899 85.49 82.653 82.505
Recall(%) 93.667 99.75 91.68 97.666 98.28
F1-score(%) 79.83 96.2 88.48 89.53 89.7
AP(%) 74.73 98.67 86.64 85.89 89.54
However, to compare with previous methods, we have performed the aggregation of
the results in Tab.4.11. According to Tab.4.11, all the proposed methods with Precision
metrics are highest on the MultiDetect method with 90.88% and with Recall they are
highest on the MultiDetect with Track method with 96.21%. In addition, we perform
an evaluation with the MAE measure between the predictive model results and the
ground truth. The results show as in Fig.4.16 below.
70
4.2 Experimental Results
Figure 4.16: MAE measurement results on 3 proposed methods in ClassHead Part 2
dataset.
Table 4.11: Experimental results in the ClassHead Part 2 dataset after using 4 meth-
ods.
ClassHead Part 2 dataset
Metrics
Head Detection MultiDetect MultiDetect Track
Precison AVG(%) 88.53 90.75 82.62
Recall AVG(%) 90.36 94.29 96.21
71
Chapter 5
Conclusions
5.1 Conclusion
In this thesis, we attempt to improve the people counting result beyond these obser-
vations. We first deploy two detectors (Yolo and Retina-Face) to detect the heads and
faces of people on the scene. We then develop a pairing technique that aligns the face
and the head of each person. This alignment helps to recover the missed detection of
head or face thus increases the true positive rate. To overcome the missed detection of
both face and head at a certain frame, we apply a tracking technique (i.e. SORT) on
the combined detection result. Putting all of these techniques in an unified framework
helps to increase the true positive rates from 90.36% to 96.21% on ClassHead Part 2
dataset.
The proposed methodology for the Improvement of People Counting by Pairing
Head and Face Detections from Still Images method was published at the 2021 MAPR
conference [31].
5.2 Future Works
People counting is the process of accurately measuring the number of people entering,
exiting, or passing through a specific area or location. The data generated by peo-
ple counting can be used to optimize operations, improve customer experience, ensure
public safety, and measure the effectiveness of marketing campaigns, among other ap-
plications. People counting is commonly used in retail stores, shopping malls, airports,
72
5.2 Future Works
public transportation systems, museums, and other public spaces. In the near future,
we will continue to work on people counting and methods of combining human body
parts. More specifically, we will study methods of combining the head with the up-
per body of a person, thereby creating a complete end-to-end network to improve the
efficiency of the people counting problem.
73
References
[1] G. Gao, J. Gao, Q. Liu, Q. Wang, and Y. Wang, “Cnn-based density estimation
and crowd counting: A survey,” arXiv preprint arXiv:2003.12783, 2020. vi, 5
[2] X. Zhao, E. Delleandrea, and L. Chen, “A people counting system based on face
detection and tracking in a video,” in 2009 Sixth IEEE International Conference
on Advanced Video and Signal Based Surveillance, pp. 67–72, IEEE, 2009. vi, 10
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
real-time object detection,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 779–788, 2016. vi, 31, 34
[4] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and
accuracy of object detection,” 2020. vi, vii, 33, 35, 37, 38, 39
[5] T.-Y. Chen, C.-H. Chen, D.-J. Wang, and Y.-L. Kuo, “A people counting system
based on face-detection,” in 2010 Fourth International Conference on Genetic and
Evolutionary Computing, pp. 699–702, IEEE, 2010. 10
[6] G. Zhao, H. Liu, L. Yu, B. Wang, and F. Sun, “Depth-assisted face detection and
association for people counting.,” in CCPR, pp. 251–258, 2012. 10, 11
[7] B. Li, J. Zhang, Z. Zhang, and Y. Xu, “A people counting method based on head
detection and tracking,” in 2014 International Conference on Smart Computing,
pp. 136–141, IEEE, 2014. 11, 12
[8] S. D. Khan, H. Ullah, M. Ullah, N. Conci, F. A. Cheikh, and A. Beghdadi, “Per-
son head detection based deep model for people counting in sports videos,” in
74
REFERENCES
2019 16th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), pp. 1–8, IEEE, 2019. 12
[9] K. Zhang, F. Xiong, P. Sun, L. Hu, B. Li, and G. Yu, “Double anchor r-cnn for
human detection in a crowd,” arXiv preprint arXiv:1909.09998, 2019. 13
[10] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, “Relational learning
for joint head and human detection,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 34, pp. 10647–10654, 2020. 13, 14
[11] P. Karpagavalli and A. Ramprasad, “Estimating the density of the people and
counting the number of people in a crowd environment for human safety,” in 2013
International Conference on Communication and Signal Processing, pp. 663–667,
IEEE, 2013. 15
[12] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” Advances
in neural information processing systems, vol. 23, 2010. 15, 16
[13] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and
X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,”
in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel,
October 23–27, 2022, Proceedings, Part XXII, pp. 1–21, Springer, 2022. 19
[14] R. E. Kalman, “A new approach to linear filtering and prediction problems,”
Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D,
pp. 35–45, 1960. 19
[15] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime
tracking,” CoRR, vol. abs/1602.00763, 2016. 22
[16] https://en.oi-wiki.org/graph/graph-matching/graph-match/. 22, 23
[17] N. Wojke, A. Bewley, and D. Paulus, “Deep sort: simple online and realtime
tracking with a deep association metric,” arXiv preprint arXiv:1703.07402, 2017.
25
75
REFERENCES
[18] S. Vogt, A. Khamene, F. Sauer, and H. Niemann, “Single camera tracking of
marker clusters: Multiparameter cluster optimization and experimental verifica-
tion,” in Proceedings. International Symposium on Mixed and Augmented Reality,
pp. 127–136, IEEE, 2002. 26
[19] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance mea-
sures and a data set for multi-target, multi-camera tracking, in Computer Vision–
ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16,
2016, Proceedings, Part II, pp. 17–35, Springer, 2016. 26
[20] T. Schmidt, K. Hertkorn, R. Newcombe, Z. Marton, M. Suppa, and D. Fox,
“Depth-based tracking with physical constraints for robot manipulation,” in 2015
IEEE International Conference on Robotics and Automation (ICRA), pp. 119–126,
IEEE, 2015. 26
[21] M. Pervaiz, Y. Y. Ghadi, M. Gochoo, A. Jalal, S. Kamal, and D.-S. Kim, “A
smart surveillance system for people counting and tracking using particle flow and
modified som,” Sustainability, vol. 13, no. 10, p. 5367, 2021. 27
[22] A. Shehzed, A. Jalal, and K. Kim, “Multi-person tracking in smart surveillance
system for crowd counting and normal/abnormal events detection,” in 2019 inter-
national conference on applied and engineering mathematics (ICAEM), pp. 163–
168, IEEE, 2019. 27, 28
[23] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525,
2017. 33
[24] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
preprint arXiv:1804.02767, 2018. 33, 36
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convo-
lutional networks for visual recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015. 36
76
REFERENCES
[26] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-
stage dense face localisation in the wild,” arXiv preprint arXiv:1905.00641, 2019.
40, 42
[27] T.-H. Vu, A. Osokin, and I. Laptev, “Context aware cnns for person head detec-
tion,” in Proceedings of the IEEE International Conference on Computer Vision,
pp. 2893–2901, 2015. 55
[28] X. Ren, “Finding people in archive films through tracking,” in 2008 IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2008. 57
[29] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection bench-
mark,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 5525–5533, 2016. 57
[30] https://machinelearningcoban.com/2017/08/31/evaluation/. 59, 60
[31] T.-O. Ha, H.-N. Tran, H.-Q. Nguyen, T.-H. Tran, P.-D. Nguyen, H.-G. Doan,
V.-T. Nguyen, H. Vu, and T.-L. Le, “Improvement of people counting by pairing
head and face detections from still images,” in 2021 International Conference on
Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6, IEEE, 2021. 72
[32] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object de-
tection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 10781–10790, 2020.
[33] S. Seong, J. Song, D. Yoon, J. Kim, and J. Choi, “Determination of vehicle trajec-
tory through optimization of vehicle bounding boxes using a convolutional neural
network,” Sensors, vol. 19, no. 19, p. 4263, 2019.
[34] Y.-Q. Huang, J.-C. Zheng, S.-D. Sun, C.-F. Yang, and J. Liu, “Optimized yolov3
algorithm and its application in traffic flow detections,” Applied Sciences, vol. 10,
no. 9, p. 3079, 2020.
77
REFERENCES
[35] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
July 2017.
[36] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 39, p. 1137–1149, June 2017.
[37] D. Peng, Z. Sun, Z. Chen, Z. Cai, L. Xie, and L. Jin, “Detecting heads using
feature refine net and cascaded multi-scale architecture,” 2018 24th International
Conference on Pattern Recognition (ICPR), pp. 2528–2533, 2018.
[38] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
convolutional networks,” in Advances in neural information processing systems,
pp. 379–387, 2016.
[39] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik, “Hypercolumns for object
segmentation and fine-grained localization,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 447–456, 2015.
[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
“Ssd: Single shot multibox detector,” in European conference on computer vision,
pp. 21–37, Springer, 2016.
[41] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards accurate region pro-
posal generation and joint object detection,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 845–853, 2016.
[42] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,”
arXiv preprint arXiv:1506.04579, 2015.
[43] Z. Huang, J. Wang, X. Fu, T. Yu, Y. Guo, and R. Wang, “Dc-spp-yolo: Dense con-
nection and spatial pyramid pooling based yolo for object detection,” Information
Sciences, 2020.
78
REFERENCES
[44] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse,
selective, and robust,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2892–2900, 2015.
[45] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining multi-
ple kernel methods on riemannian manifold for emotion recognition in the wild,”
in Proceedings of the 16th International Conference on multimodal interaction,
pp. 494–501, 2014.
[46] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic segmentation
with second-order pooling,” in European Conference on Computer Vision, pp. 430–
443, Springer, 2012.
[47] E. U. Haq, H. Jianjun, K. Li, and H. U. Haq, “Human detection and tracking with
deep convolutional neural networks under the constrained of noise and occluded
scenes,” Multimedia Tools and Applications, vol. 79, no. 41, pp. 30685–30708,
2020.
[48] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An-
dreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for
mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[49] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neu-
ral networks,” in International Conference on Machine Learning, pp. 6105–6114,
PMLR, 2019.
[50] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 580–587, 2014.
[51] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on
computer vision, pp. 1440–1448, 2015.
79
REFERENCES
[52] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks,” Advances in neural information
processing systems, vol. 28, pp. 91–99, 2015.
[53] T.-O. Ha, T. N. P. Truong, H.-Q. Nguyen, T.-H. Tran, T.-L. Le, H. Vu, and H.-G.
Doan, “Automatic student counting in images using deep learning techniques, ap-
plication in smart classroom management (in vietnamese),” in The 23rd National
Conference on Electronics, Communications and Information Technology-REV-
ECIT, pp. 142–146, 2020.
[54] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, and S. Z.
Li, “Improved selective refinement network for face detection,” arXiv preprint
arXiv:1901.06651, 2019.
[55] T. Teixeira and A. Savvides, “Lightweight people counting and localizing in in-
door spaces using camera sensor nodes,” in 2007 First ACM/IEEE International
Conference on Distributed Smart Cameras, pp. 36–43, IEEE, 2007.
[56] J. Luo, J. Wang, H. Xu, and H. Lu, “A real-time people counting approach in in-
door environment,” in International Conference on Multimedia Modeling, pp. 214–
223, Springer, 2015.
[57] T. Zhao and R. Nevatia, “Bayesian human segmentation in crowded situations,”
in 2003 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2003. Proceedings., vol. 2, pp. II–459, IEEE, 2003.
[58] E. Zhang and F. Chen, “A fast and robust people counting method in video
surveillance,” in 2007 International Conference on Computational Intelligence and
Security (CIS 2007), pp. 339–343, IEEE, 2007.
[59] T.-H. Vu, A. Osokin, and I. Laptev, “Context-aware cnns for person head detec-
tion,” in Proceedings of the IEEE International Conference on Computer Vision,
pp. 2893–2901, 2015.
80
REFERENCES
[60] W. Wong, D. Q. Huynh, and M. Bennamoun, “Upper body detection in uncon-
strained still images,” in 2011 6th IEEE Conference on Industrial Electronics and
Applications, pp. 287–292, IEEE, 2011.
[61] C. Gao, P. Li, Y. Zhang, J. Liu, and L. Wang, “People counting based on head
detection combining adaboost and cnn in crowded surveillance environment,” Neu-
rocomputing, vol. 208, pp. 108–116, 2016.
[62] W. Liu, M. Salzmann, and P. Fua, “Counting people by estimating people flows,”
arXiv preprint arXiv:2012.00452, 2020.
[63] X. Shi, X. Li, C. Wu, S. Kong, J. Yang, and L. He, “A real-time deep network
for crowd counting,” in ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 2328–2332, IEEE, 2020.
[64] B. Wei, M. Chen, Q. Wang, and X. Li, “Multi-channel deep supervision for crowd
counting,” arXiv preprint arXiv:2103.09553, 2021.
[65] D. Liang, X. Chen, W. Xu, Y. Zhou, and X. Bai, “Transcrowd: Weakly-supervised
crowd counting with transformer,” arXiv preprint arXiv:2104.09116, 2021.
[66] J. Liu, J. Liu, and M. Zhang, “A detection and tracking based method for real-
time people counting,” in 2013 Chinese Automation Congress, pp. 470–473, IEEE,
2013.
[67] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T.-K. Kim, “Multiple object
tracking: A literature review,” Artificial intelligence, vol. 293, p. 103448, 2021.
81